Transformations and Constrasts in Multiple Regression
Author
Reinhold Kliegl
1 Setup
Code
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Code
library(car)
Loading required package: carData
Attaching package: 'car'
The following object is masked from 'package:dplyr':
recode
The following object is masked from 'package:purrr':
some
Code
library(performance)library(emmeans)
Welcome to emmeans.
Caution: You lose important information if you filter this package's results.
See '? untidy'
Code
library(hypr)
2 Transformation after checking model residuals
2.1 Analyses of six-minute run: No need for transformation
Call:
lm(formula = score ~ 1 + age + Sex + cohort, data = dat_star)
Residuals:
Min 1Q Median 3Q Max
-8.0146 -2.5613 -0.5408 1.9885 18.4222
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -199.25525 100.81191 -1.977 0.04840 *
age -1.33898 0.44055 -3.039 0.00244 **
SexBoys -0.49168 0.25611 -1.920 0.05520 .
cohort 0.11738 0.05032 2.333 0.01987 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.84 on 896 degrees of freedom
Multiple R-squared: 0.01821, Adjusted R-squared: 0.01492
F-statistic: 5.54 on 3 and 896 DF, p-value: 0.0009052
Code
check_model(m1_star)
Code
boxCox(m1_star)
Tukey’s ladder of power transformations for lambdas: + lambda = -1, then 1/score
+ lambda = -1/2, then 1/sqrt(score) + [lambda = 0, then log(score)] – special case because sccore^0 = 1 + lambda = +1/2, then sqrt(score) + lambda = 1, then no transformation
Code
dat_star$score_r =51/dat_star$scorem1_star_r <-lm(score_r ~1+ age + Sex + cohort, data=dat_star)summary(m1_star_r)
Call:
lm(formula = score_r ~ 1 + age + Sex + cohort, data = dat_star)
Residuals:
Min 1Q Median 3Q Max
-0.91937 -0.18806 0.00168 0.18231 0.92385
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 19.102636 7.521905 2.540 0.01127 *
age 0.107257 0.032871 3.263 0.00114 **
SexBoys 0.043998 0.019109 2.302 0.02154 *
cohort -0.008935 0.003754 -2.380 0.01752 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2865 on 896 degrees of freedom
Multiple R-squared: 0.02158, Adjusted R-squared: 0.01831
F-statistic: 6.589 on 3 and 896 DF, p-value: 0.0002092
Code
check_model(m1_star_r)
Code
boxCox(m1_star_r)
3 Factors and covariates
The following analysis assume a balanced design, that is equal number of children in each of the Sex x Cohort cells of the design.
Contrasts are needed for factors; some form of centering is often needed for covariates.
3.1 Factor Sex
3.1.1 Treatment coding
If you don’t specify a contrast, R will specify a treatment contrast for you. This is the default setting.
# A tibble: 2 × 4
Sex N M SD
<fct> <int> <dbl> <dbl>
1 Girls 450 976. 134.
2 Boys 450 1040. 152.
Code
(Boys_Girls <- tbl_sex$M[2] - tbl_sex$M[1]) # Boys - girls
[1] 64.76667
Code
(GrandMean <-mean(dat_run$score))
[1] 1008.106
(Intercept) is mean of group with level 0: Girls.
Sex2 is difference of Boys’ and Girls’ means.
Note: When the data are balanced, you don’t need multiple regression (MR) to estimate effects; you need MR to estimate SEs of effects, that is to test the significance of the difference.
Call:
lm(formula = score ~ 1 + cohort, data = dat_run)
Residuals:
Min 1Q Median 3Q Max
-437.76 -95.42 3.07 98.38 468.58
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12752.533 3797.832 3.358 0.000819 ***
cohort -5.828 1.885 -3.092 0.002047 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 146 on 898 degrees of freedom
Multiple R-squared: 0.01054, Adjusted R-squared: 0.009435
F-statistic: 9.563 on 1 and 898 DF, p-value: 0.002047
The (Intercept) estimate looks very different. It is the Grand Mean when cohort has the value 0, that is around the first Christmas in history. This is not very plausible, or course.
It is interesting to compare m3_run and m4_run_sqd.
Code
anova(m3_run, m4_run_sqd)
Analysis of Variance Table
Model 1: score ~ 1 + cohort
Model 2: score ~ 1 + Cohort
Res.Df RSS Df Sum of Sq F Pr(>F)
1 898 19140278
2 891 18966199 7 174080 1.1683 0.3184
Thus, it was a bit of waste to spend eight degrees of freedom for Cohort. We can get the same goodness of fit with 1 df for cohort.
3.3.3 Hypothesis coding
The most common contrasts [contr.sum(), MASS::contr.sdif(), contr.treatment()] are available as R functions. If you want to set up your own contrast, not covered by the pre-programmed ones, see: